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Abstract 

In this paper we study the convex problem of optimizing the sum of a smooth function and 
a compactly supported non-smooth term with a specific separable form. We analyze the block 
version of the generalized conditional gradient method when the blocks are chosen in a cyclic 
order. A global sublinear rate of convergence is established for two different stepsize strategies 
commonly used in this class of methods. Numerical comparisons of the proposed method to 
both the classical conditional gradient algorithm and its random block version demonstrate the 
effectiveness of the cyclic block update rule. 

Keywords: Conditional gradient, cyclic block decomposition, iteration complexity, linear oracle, 
nonsmooth convex minimization, support vector machine. 


1 Introduction 

With the growth of size of problems commonly encountered in many applied fields, there is a strong 
demand for numerical methods featuring low computational cost iterative schemes. By low compu¬ 
tational cost, we mean algorithms which require at most matrix by vector multiplication (inversion 
of matrices are, for example, too expensive). In this context, it is necessary to propose and analyze 
numerical schemes that 

• are based on computationally efficient steps; 

• exploit problem structure and data information; 

• enjoy global convergence properties and iteration complexity estimates. 

We consider structured convex problems consisting of minimizing the sum of two terms: a smooth 
term, which is a composition of a smooth function and a linear mapping and a nonsmooth separable 
term. We focus on programs for which the geometry of the non-smooth part exhibits such a degree 
of complexity that proximal-based methods ISIEIEI] do not constitute a viable alternative. Indeed, 
the efficiency of these methods considerably deteriorates in situations that do not fit in a “favorable 
geometric settings”, see mm for a more detailed description of this concept. Algorithms based on 
linear oracles, such as the conditional gradient method (also known as the Frank-Wolfe algorithm) 
[12 [2110 ESI El and their extensions to structured composite problems mum, or block separable 
problems |19j . are based on the principle of iteratively solving linearized subproblems. In settings 
where computing the proximal operator is too expensive, this approach has proven to be competitive. 
Successful examples of applications which benefit from this approach include trace-norm constrained 
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or penalized problems mm and structured multiclass classification with extremely large number 
of classes [l9] . On the theoretical side, for most of the algorithms mentioned in the references above, 
a sublinear rate of 0(1/k), both in function values and duality gap, is available. 

Continuously increasing problem dimensions have motivated the principle of taking advantage 
of available block structure of the problem at hand. This led to the development of variable de¬ 
composition methods which break down the original large-scale problem into several much smaller 
subproblems that could be solved efficiently. 

In recent works, [23] and m analyzed the average case iteration complexity of block versions 
of gradient, projected gradient and forward-backward methods where at each iteration, the block 
to be updated is selected at random. We refer to such a block selection rule as the random update 
rule. In the context of linear oracles, m applied this approach to the conditional gradient method 
focusing on implementing it for the structured Support Vector Machine (SVM) training problem 
(see more details in Section 6.2). Analysis of algorithms involving random update rules typically 
provides average case complexity results. A different kind of works consider updating the blocks in a 
cyclic order. We refer to such a deterministic update rule as the cyclie update rule. In this context, 
|22j provides an asymptotic analysis of exact coordinate minimization for composite strongly convex 
problems. In |B], a global rate of convergence result was established for the cyclic block coordinate 
gradient projection method for convex problems over feasible sets with a separable structure. In [28], 
an explicit rate is given for the block proximal gradient method for ii regularized convex problems. 
For this line of works, the estimates given for the cyclic update rule deterministically hold for the 
sequence of function values. As we already mentioned, this is not the case for the random update 
rule for which only average case estimates are available. 

Another relevant feature of block decomposition methods is the fact that they usually allow to 
take a substantillay larger step at each iteration (with respect to each block) compared to their 
classical non-block counterparts (using, for example, hxed stepsize, backtracking or line search). 
This fact potentially gives a numerical advantage to block decomposition algorithms compared to 
classical variants, see [6|. 

In this work, we propose a cyclic block version of the generalized conditional gradient method 
mum which for ease of reference is called the Cyclic Block Conditional Gradient (CBCG). The 
word “generalized” means that we consider a more general nonsmooth part, which is not necessarily 
an indicator function. The word “cyclic” means here that each block is updated once at each 
iteration. The order in which the blocks are updated may vary arbitrarily between iterations and 
our analysis therefore includes the random permutation approach. We provide deterministic and 
global convergence rate estimates for the predefined stepsize strategy naiizi, and for an adaptive 
stepsize strategy |21) including a backtracking version of it. These rates hold independently of the 
order of updating the blocks at each iteration. We also establish rate estimates for an optimality 
measure in the spirit of naE], which constitutes an additional novelty compared to the results 
presented in [6]. All the rates proved below have the form of 0{l/k) where k is the number of 
complete iterations (over all blocks). Interestingly, this analysis leads to new convergence results 
for methods that do not relate directly to linear oracles at first sight. For example, our results 
lead to explicit deterministic rate estimates in terms of the duality gap for the cyclic and random 
permutation variants of the Stochastic Dual Coordinate Ascent (SDCA) of |29| . 

We numerically compare the proposed CBCG method (using both the cyclic and random permu¬ 
tation updating rules) to its random update rule counterpart, that is, the Random Block Conditional 
Gradient (RBCG), which was proposed and analyzed in |19j . Extensive simulations on a large num¬ 
ber of synthetic examples suggest that CBCG is competitive with both RBCG and the classical 
conditional gradient algorithm (CG). Finally, we also compare CBCG and RBCG on the problem 
of training the structured SVM mm for the optical character recognition task (OCR) originally 
proposed in |30j . In this setting, we observe that the random permutation updating rule has advan¬ 
tage over the other updating rules. 
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The next section is dedicated to the presentation of the model and main assumptions (see Section 
2.1) and to the description of the CBCG algorithm (see Section [T2)). Section [^presents few auxiliary 


results for an optimality measure which is commonly encountered when using methods which are 
based on linear oracles. In Section]^ we present our theoretical findings about the rate of convergence 
results of the CBCG method. We split this section into two subsections which deal with the two 
different stepsize strategies that we analyze in this paper (Section |4.1| for the predefined stepsize and 
Section 4.2 for the adaptive stepsize). We conclude Section with a discussion on a backtracking 
version of the CBCG method (see Section 4.3). Section presents an extension of the analysis given 
in Section 4.2 for the specific case where the stepsize is chosen using exact line search for problems 
in which the smooth part of the objective function is quadratic. This leads to a discussion about the 
implications for block coordinate descent methods. Numerical experiments on synthetic data and on 
the structured SVM training problem are presented in Section 


Conventions. Throughout the paper the underlying vector space is the n-dimensional Euclidean 
space M” with the / 2 -iiorm which is denoted by H-H. This notation is also used for the matrix norm, 
which is assumed to be the spectral norm. We will consider the partition of an arbitrary vector 
X G M” into N blocks where each block consists of a subset of the n coordinates. The size of each 
block (the number of coordinates) is given by the integer rii for i = 1,2, ..., N , such that Uj = n. 

The rth block of a vector x G M"" is denoted by Xj. We assume that x G M" can be written as follows 


X = 


/xi\ 

X2 

VxTv/ 


For any i = 1,2,... ,A^ we define the matrix Uj G as the sub-matrix of the n x n identity 

matrix consisting of the columns corresponding to the ith block. Thus, in particular, 


(Ul,U2,...,Ujv) =In. 

It is clear that using these notations, we have for any x G M” that Xj = U^x and x = UjXj. 
Finally, for any subset S of M”, 5s denotes the indicator function of S which takes the value 0 on 5 
and +00 otherwise. 


2 The Optimization Model and Algorithm 

2.1 Problem Formulation and Assnmptions 

We consider the optimization model 


min (x) = F(Ax) + ^ 5 i(xj)l 


( 2 . 1 ) 


where A G We make the following standing assumption on model (2.1). 

Assumption 1. 

(i) Qi : —)■ (— 00 , co], i = 1,2,... ,N, is a proper, closed and convex function which satisfies 

— Xi = dom(7j C M"'* is a compact set with diameter Di, that is, 

sup ||xi-yi||=A- 
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— gi is globally Lipschitz on Xi with constant li, that is, 


\gi{-Ki) - gi{yi)\ <li\\:>^i-yi\\, Vxj,yjeXi. (2.2) 

(ii) F : —)> M is convex and continuously dijferentiabl^ over A (Xi x A 2 x • • • x Xn) C MX 

and has Lipschitz continuous gradient with constant Lp, that is, 

||VF(x)-VF(y)|| <LF||x-y||, V x, y G A (Ai x X2 x • • • x . 


Since gi is assumed to be convex, it immediately implies that the domain Xi is a convex subset of 
‘ for i = 1, 2,..., A. We set g{'x.) = /(x) = F(Ax). The domain of g is denoted 


by dom^r = X and its diameter is denoted by D. Using these simplihed notations, problem (2.1) 
actually consists of minimizing the sum f + g. It holds that X = Xi x A 2 x ■ ■ ■ x X]\f and therefore 


N 




(2.3) 


i=l 


Remark 2.1. By setting g (•) = 5x (•)? recover the constrained optimization model that motivated 
the development of the traditional conditional gradient method (see m and references therein). This 
is also the case when we add a linear term to the indicator. 


Under Assumption problem (2.1) is guaranteed to attain its optimal value, and therefore the 
optimal set, which is denoted by X*, is nonempty and the corresponding optimal value is denoted 
by H* G M. For each block i G {1,2,..., A}, we employ the following notation: 

• Aj = AUj, so that A = (Ai, A2,..., A^r); 


• Vi/(x) = UfV/(x) denotes the partial gradient of /. 

Using the previous notations, we have that Vj/(x) = A^VF’(Ax). We will use a refined notion 
of Lipschitz continuity that fits our block separable composite setting. This is expressed by the 
following standing assumption. 


Assumption 2. For each i = 1,2,..., N, there exists a constants jdi > 0, such that for any x G A 
and any hj G M""* satisfying x + Ujhj G A, it holds that 


||VF(Ax + Aihi) - VT(Ax)|| < fdi ||Aihi|| . 

Assumption can be seen as a consequence of Assumption Indeed, it is always possible to 
set /3j = Lp, i = 1,2,..., A. However, adopting this more refined convention provides additional 
algorithmic flexibility which allows to take advantage of conditioning disparities between different 
blocks by using stepsizes which are functions of the Lipschitz constants of the blocks, rather than the 


global Lipschitz constant. See for example Section 5.1, where it is shown how this approach allows 
the usage of exact line search for quadratic problems. We also note that when A = I, then /3j can be 
chosen to be the fth block Lipschitz constant of the gradient of F (see, e.g., mm), which is always 
a quantity smaller than Lp. We define the following quantity 


Prain — min 1 , 01 ,/32, ■ • ■,I3n} > 0. 


(2.4) 


The following important result will play a central role in the forthcoming analysis. The proof is almost 
identical to the well known proof of the descent lemma (see El), and is thus given in Appendix [A} 

Lemma 2.2 (Composite block descent lemma). Let i G {1,2,...,A}, then for any x G A and 
hj G M”'' such that x + Ujhj G X, we have 

/(x + U,h,) </(x) + (Vj/(x),h,) + |||Ajh,f . (2.5) 

function is continuously differentiable over a given set D if it is continuously differentiable over an open set 
containing D. 
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2.2 The Cyclic Block Conditional Gradient (CBCG) Method 


The generalized conditional gradient method mum can be applied to problem ( |2.1[ ) when the 
corresponding linear oracle is available. Therefore we assume that for any x € X, the solution of the 
following problem can be easily computed: 


min{(V/(x) ,v) +5(v)} . 

We exploit here the separability of the function g (see Section [2T| ) and propose a block decomposition 
extension of the generalized conditional gradient method which we call the Cyclic Block Conditional 
Gradient (CBCG) method. Before stating the algorithm, we will need the following additional 
notation. Let be a given sequence, then for any i = 1, 2,..., A^, we define 




xS.i = 


X, 


fc +1 


x: 


i+l 




( 2 . 6 ) 


That is, the first i blocks in x^’* are those of and the remaining N — i blocks are those of x^. It 
is clear that using this notation we have x^’*^ = x^ and = x^’'^. The algorithm is given now. 


CBCG: Cyclic Block Conditional Gradient 

Initialization, x*^ G X and G [0,1] for all A: G N and i = 1,2,..., N. 
General Step. For A: = 1, 2,..., 


(i) For any i = 1, 2,..., iV, compute 



G argminp.gjf, |(Vi/(x^’* ^), p*) + 5i(pi)} , 

(2.7) 

and then 

^k,i ^ ^ afUi(p,^ - x^^’*-^). 

(2.8) 

(ii) Update x^+^ 

= X^’^. 



We will first analyze, in Section 4.1 the convergence rate of the CBCG method using a predefined 
stepsize m- Here, the predefined stepsize that we use is given, for any A = 1, 2,..., iV, by 


a. 


= a 


2 

A: + 2' 


In Section 4.2, we will consider an adaptive stepsize rule m, which is determined by the minimization 
of the quadratic upper bound of H related to (2.5). The expression of this stepsize will be made 
precise below (see (4.12)). The backtracking variant of the CBGG with adaptive stepsize rule is 
presented and analyzed in Section |4.3[ 


3 The Optimality Measure 

In this section, we describe few properties of an optimality measure including its block counterparts. 
This measure is typical when discussing methods which are based on linear oracles and usually plays 
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a crucial role in the convergence analysis, see [1] for an overview and nail] for a link with Fenchel 
duality. For any x € X, we define the following quantity 


p{x) £ argminpgjY {(V/(x),p) +5(p)}, 
as well as the optimality measure 

S’(x) = max{(V/(x),x- p) + 5 (x) - 5 (p)} = {Vf{x),x - p{x)) + g{x) - g{p{x)), 

pEA 


(3.1) 


(3.2) 


where the last equality follows from (3.1). The function S is an optimality measure in the sense that 
it is non-negative on X, and it is zero only on X*. Furthermore, for any x £ X, the quantity S (x) 
is an upper bound on H (x) — H* as stated in the following lemma whose short proof is given for the 
sake of completeness. 

Lemma 3.1. S{x) > H (x) — H*. 

Proof. For any x* £ X* we have 

5 '(x) = (V/(x), X - p{x)) + g{x) - gip{x)) 

= (V/(x),x) -Fg(x) - [(V/(x),p(x)) -h5(p(x))] 

> (V/(x), x) g{x) - [(V/(x), X*) -h 5f(x*)] 

= (V/(x),x-x*) +g{x) - gix*), 


where the inequality follows from (3.1). Using the convexity of /, we obtain that 

5’(x) > (V/(x), X - X*) g{x) - g{x*) > /(x) - f{x*) + g{x) - g{x*) = H{x) - H* 
This proves the desired result. 


□ 


We refine the notations introduced in (3.1) and (3.2) in order to ht them to our block structured 

(3.3) 


setting. For any x £ X and any f = 1, 2,..., Ai, we set 

Pi{x) £ argminp.gjf, {(Vi/(x), pi) gi{pi)} , 

and define the block optimality measure, which was already introduced in m when gi = 0, 


Si{x) = max {(V/i(x),Xj - p*) + gi{xi) - gi{pi)} . 

Pi&Xi 

It is clear that in this case it is also true that 

Si (x) = (V/i (x) ,Xi-pi (x)) -h gi (xi) - gi {pi (x)). 

Using the separability of both g and X, we have for any x £ X that 

N 


-S' (x) = ^ Si (x) 


(3.4) 


(3.5) 


(3.6) 


i=l 


There might be multiple optimal solutions for problem and also for problem ( |3.3[ ). Our only 
assumption is that the choices of pi{x),p2{x), ... ,p]\f{x) and p{x) are made under the restriction 
that 

/ Pi(x)\ 

, , P2(x) 

P(x) = . 

\PNix)J 

The following Lipschitz-type property of the block optimality measure Si, i = 1,2,... ,A^, will be 
crucial in the forthcoming analysis. 
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Lemma 3.2. Let x, y G X he two vectors which satisfy Xj = for some i = 1,2,..., N. Then the 
following inequality holds 

|5i(x) - 5'i(y)| < LpDi ||Ai|| • ||A(x - y)|| . 


Proof. From (3.5) we have 


5i(x) = (V/i (x) ,Xi -Pi(x)) + ffi(xi) - Qiipiix.)) 

= (V/j(y),Xi -Pj(x)) + 5 i(xi) - gi{pi{^)) + (V/i(x) - V/i(y), x* - pj(x)). 

Now, using the fact that / (x) = F (Ax) and Assumption we obtain 

(V/i(x) - V/i(y),Xj -Pi(x)) = (Af(VF(Ax) - VF(Ay)), Xj - pj(x)) 

= (VF(Ax) - VF(Ay), Ai(xj - Pi(x))) 

< ||VF(Ax) - VF(Ay)|| • || A,(x, - p,(x))|| 

< Lf ||A(x - y)|| • ||Ai(xj -pi(x))|| 

< LpDi ||Ai|| • ||A(x - y)|| , 


(3.7) 


(3.8) 


where the first inequality follows from the Cauchy-Schwarz inequality and the last inequality follows 
from the fact that both Xj and pj(x) belong to Xi (see Assumption [^i)). Finally, by combining (3.7) 
with (3.8), and using the fact that Xj = y^, we obtain that 


^^(x) < (V/i(y),Xj -pi(x)) + 5 i(xi) - gi{pi{-x.)) + LpDi ||Ai|| • ||A(x-y)|| 

= (V/i(y),yi -k(x)) +gi{yi) - gi{pi{x)) + LpDi ||Aj|| • ||A(x-y)|| 

< S'i(y) + LpDi ||Ai|| • ||A(x - y)||, (3.9) 


where the last inequality follows from the definition of Si (see (3.4)). Changing the roles of x and y, 
we also obtain that 

Sify) < Si{-K) + LpDi ||Aj|| • ||A(x - y)||, 


which along with (3.9) yields the desired result. 


□ 


4 Convergence Analysis of the CBCG Method 

This section is devoted to the convergence analysis of the CBCG algorithm. We will first prove, in 
Section 4.1, a sublinear convergence rate for the variant with the predefined stepsize rule. A similar 


rate of convergence will be then established, in Section [4.2t for the variant with the adaptive stepsize 
rule. Finally, we describe a backtracking procedure in Section |4.3| which allows to use the CBCG 
method when the constants (di, i = 1,2,... ,N, given in Assumptionare unknown in advance. We 
begin with an extension of Lemma 2.2 that holds for any choice of stepsize. 

Lemma 4.1. Let {x^jfcgFj he the sequence generated hy the CBCG method. Then, for any k > 0 
and i G {1,2,..., N}, we have 




1 




k\2 


ft 


lAftpf - x?: 


Proof. First, by the definition of the main step of the CBCG method (see (2.8)), we have 
i7(x"’*)=/(x"’*)+5(x'^-*) 


= /(x"’*-i + afU,(p(= - xf*-')) + <7(x"’*-i + afU,(pf - 


X 


k,i—l 


))• 
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We can now use Lemma 12.21 to obtain 


+ 


2 i|Ai(Pi-x, 


k _ k,i-l.u2 j_ j_ k 




(4.1) 


The last term can be bounded from above as follows: 

N 

^(xfc.*-i + a’^Viip^ - x*^’*-^)) = 

j=i,j¥=i 

N 

= + aHgiiPi) - 5 i(x,^’*"^)), 


(4.2) 


where the inequality follows from the convexity of each gi, i = 1,2,..., N. Now, combining Ml 


with ( |4.2| ) and using the definition of Si (see (3.5)), we obtain 

//(x"’*) < /(x^’*-i) +5(x"’*-i) -«,^((V,/(x^’*-i),x^‘-i -pf) +5*(x^*-i) -5*(p,'=)) 

(Oj ) A II . / fc 


+ 




//(x^’*-^) - a’lS 



H|A,(pf-xE')f 

2 

^(xA:,i-l) _ ^kg 

i(x"’*-i) + 


l||A.(pf-x‘)||2. 

2 


where the last inequality follows from the fact that x^' = x^’* ^ (see (2.6)). This proves the desired 


result. 


□ 


We also need an explicit expression of the distance between two consecutive iterates generated 
by the CBCG method. The following lemma also holds for any choice of stepsize. 

Lemma 4.2. Let {x^jfcgi!^ be the sequence generated by the CBCG method. Then, for any G N, 
we have 


N 


i=i 


, M_xfc||2 < ||x^+l -x^lP. 


Furthermore, for any i = Q,l,... ,N, we have 

||x^’*-x^|| 

Proof. For any fixed fc G N, we have from the definition of the iterates of the CBCG algorithm (see 
(2.8)) and (2.6) that 


ifc+l _xfc||2 = 


N N 

j=i i=i 


_^ku2 


which proves the first statement. Furthermore, since for any i = 0,1,..., A^, x*)’* = x^ for all j > i 
and x^’* = x^’-^ for all j < i, we have 

Nil 
k,i .^k\\2 _ ■^k\\2 


ix"" - x^ip = V iix^^'* - = Y iix = E 

i=i 


E 

1=1 

N 


El 

1=1 


< ^|| X ^^’^- X ^^||2 = || X ^+1 - X ^ 
1 = 1 










which proves the second statement and the proof is complete. 


□ 


4.1 Analysis for the Predefined Stepsize Strategy 

In this section, we analyze the CBCG algorithm with the predefined stepsize given by 

2 


af = = - -. 

* k + 2 


(4.3) 


Note that this stepsize rule is completely blind to disparities between blocks and therefore does not 
allow to counterbalance them in a way that will increase the convergence speed. The main step 
towards the proof of convergence rate in this case is recorded in the following lemma. 

Lemma 4.3. Let be the sequence generated by the CBCG method with the stepsize given in 

(4.3). Then, for any k >0, we have 




where 


N 


N 


Cl = A ||A*f of ^ 2LfD II All a ||A*|| . 


(4.4) 


i=l 


2=1 


Proof. For any k > 0 and each i G {1,2,..., N}, we have from Lemma 4.1 that 

< i!(x'=’'-i) - 0^=5,(x'^’'-^) + - xf)||2 


2 


< i!(x^’*-i) - a^5i(x^’*-^) + 

Summing ( |4.5| ) for i = 1, 2,..., A yields 
A(x^+^) = A(x^’^) 

< A(x'^’O) - ^ 5,(x'^''-i) + ^ ^(/3,|| AifA 


(4.5) 


2=1 

N 


2=1 


N 


(^k\2 N 

= il(x^) -a^Y. ^ j;(A|| A^fAf) + ^^(^.(x") - 5,(x'=’*-i)) 


2=1 


2 = 1 


2 = 1 


( k\2 ^ ^ 

i4(x") - a"5(x'=) + ^ j;(/3,|| A,fA2) + ^^(^.(x^) - 5*(x"’*-i)), (4.6) 


2=1 


2=1 


where the last equality follows from (3.6). Using Lemma 3.2, and the fact that x”’ = x^ (see 

(4.7) 


(2.6)) gives, for any i = 1,2,..., A, that 

5,(x'^) - 5i(x'=’*-i) < A A IIAill • ||A(x'^ - x'=’'-i) 


Combining (4.6) and (^4.7| yields 


A(x^+^) < A(x^) 


a’^Six^) + 


2 


N 


N 


All Ai 


A(x^-x^’*-i)||. (4.8) 
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From Lemma |4.2[ for any i = 1,2,..., N, we have 

||A(x*^ - < ||A|p • ||x^ - x^’®“^|p < ||A|p • ||x^ - x^+^lp 


N 


N 


= [a 


k\2 


iAiPEiip‘-4ii's (“ViiAiPE^: 


i=i 

= (a^?\\A\\^D^. 


i=i 


Combining (4.8) and (4.9) yields 


< H{x^) - a^S{x'^) + 




[a 


k\2 


N 


N 


Y,PMifDi + 2LFD\\A\\Y,Di\\Ai 


. 2=1 


2 = 1 


(4.9) 


which proves the desired result. □ 

We now recall the following technical lemma on sequences of non-negative numbers (cf. [U Lemma 

!])• 

Lemma 4.4. Let {ofcjfcgN and {^fclfeeN be two sequences satisfying, for any k > 0, that 

C 


(kk+l ^ tkbk 2 


(4.10) 


where 0 < < bk, tk = 2/{k + 2) and C > 0, then 

(a) ttk < 2C/ (A: -|- 1); 

(b) For any n > 1, we have 

, 8C 
mm bk < —. 
k=ln/2\^...,n 71 

We are now ready to establish the sublinear rate of convergence of the CBCG algorithm with 
predefined stepsize. 

Theorem 4.5 (Sublinear rate for CBCG with predefined stepsize). Let {x^j^gi^ be the sequence 
generated by the CBCG method with the predefined stepsize strategy given in (4.3). Then, for any 
k > 0, we have 


i2(x^) -H* < 


2Ci 
fc + 1 ’ 


(4.11) 


and, for any n> 1, we have 


k\ ^ 8Ci 


min S(x^) < 

k=ln/2\,...,n H 


where 


N 


N 


Ci = Y,Pi WAifDf + 2LfD ||A|| ^ a Pill . 


2=1 


2=1 


Proof. From Lemma 4.3 we have 


-H* < H{x^) -H* - a'^Six^) + ^(P)^ 


In addition, by Lemma 3.1, we have that ^(x^) > H(x^) — H*. We can therefore invoke Lemma 
with Ok = H{x^) — H* and bk = <S'(x^), and the desired result is established. 


4.4 


□ 
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4.2 Analysis for the Adaptive Stepsize Strategy 

We now turn to the analysis of CBCG with adaptive stepsize. The main insight is to choose a stepsize 
that minimizes the quadratic upper bound given in Lemma 4.1 In this setting, the adaptive stepsize 
is defined as follows: 


a,- 


= argmin„g[o,i] ^) + ■ y ||Ai(pf - 


X 


fcM|2 


= mm 


,di||Aj(pf -x; 




(4.12) 


Note that by Lemma 4T this choice of stepsize leads to a decrease of objective value at each step. 
The analysis employed for the predefined stepsize does not seem to be easily adjusted to the adaptive 
stepsize rule. Thus, a different analysis is developed in this section. We begin with the following 
technical result. 

Lemma 4.6. Let {x^j^gis^ he the sequence generated by the CBCG method with the adaptive stepsize 
strategy given in (4.12). Then, for any k > 0 and i G {1, 2,..., N}, we have 

|A*(pf-x; 




a.' / h .■_! x . (Oj ) /3i II /'„A: l|2 


Proof. We split the proof into two cases. First, if 


S’ifx' 


kA—1') 


then = 1, and by Lemma 


4.1 


we have 


> 1 , 


(4.13) 


i/(x^’*-i) - /7(x^’*) > afS*(x^’*-^) - (°^y^^ ||A^(pfc _ xf)||2 


= 




A 


> — Si(x' 

- 2 a 


-^||A.(xf-pf 


fcM|2 


M-1)> (^^||A^(xfc_pfc)||2^ 


(4.14) 

(4.15) 


where the last two inequalities follows from (4.13) and the fact that = 1. In the second case, 
when 5i(x^’®“^) < /3j||Aj(pf — x^)|p, we have that 


a? = 


5i(x^’*-i) 


/3i||Ai(pf -x; 


k\\\2- 


Thus, 


Lr(x'=’*-i) - /f(x^’*) > 


2 


|Ai(pf -x: 


2A||Ai(pf-xf)||2 


(4.16) 


The result now follows by combining (4.15) and (4.16). 


We can now prove the following important result. 


□ 
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Lemma 4.7. Let {x^j^gis^ be the sequence generated by the CBCG method with the adaptive stepsize 
strategy given in (4.12). Then, for any A: > 0 and i G {1, 2,..., N}, we have 




k,i\ 


2max{l3i\\Ai\\^Df,Ki} ’ 


where, for each i = 1,2,... ,N, 


Mi = max||Vj/(x)|| , 

xgX 


Ki — {Mi + li)Di 

Proof. We again split the proof into two cases. First, if = 1, then by Lemma 

where the last inequality follows from the fact that for any x G X: 

5i(x) = (Vi/(x), Xj - Pi(x)) + gi{yLi) - gi{pi{x.)) 

< ||Vj/(x)|| • ||Xj - Pi(x)|| + li ||Xi - Pi(x)|| 

< {Mi + li)Di. 


4.6 


we have 


(4.17) 

(4.18) 

(4.19) 


1 7 

In the second case, when < 1, we have that af = k_ bp , and by Lemma 


4.6 


q.( k,i-l\2 

TT/„kA\ \ J ^ ‘-'i 


77(x'=’*-^) - 77(x''’*) > 


> 




2/3,||A,(pf-xf)P - 2/3,11 A,pzif 
The result now follows by combining (4.20) and (|4.21[). 


(4.20) 
we can write 

(4.21) 

□ 


We will now prove an additional technical lemma that establishes a “sufficient decrease” property 
of the CBCG method with adaptive stepsize. 

Lemma 4.8. Let {x*^}fcgpj be the sequence generated by the CBCG method with the adaptive stepsize 
strategy given in (4.12). Then, for any k > 0 and i G {1, 2,..., N}, we have 

^k,i-l\u2 


Pn 


H{x.'^) - 77(x^+p > II Afx'* - x'^ 


where Pram is given by (2.4). 


Proof. By Lemma 4.6 we have 


^(x^-i) _^(x*:4) > 



X.- 


(4.22) 
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Now, for any i G {1,2,..., N}, we can write 


|A(x«-xW->)|p= Y^Aii 

i=i 


N 




N 


||2 


<JV^||Aj(xJ-x‘’ 

i=i 

= A|;||A,(xJ-xf-‘)|p 

i— 1 

= A^(a?)"||Aj(xJ-pJ)||= 

j = l 


J=1 


< 


2iV 

f^min 

2N 




/3r7 


(F(x'') - /7(x^+i)) 


where the first inequality follows from the fact that for any N vectors ui, U 2 ,..., u^r, the inequality 
2 


Uj < N X]j=i ||uj If holds; the second inequality follows from (4.22), and the last inequality 
bllows from Lemma |4.6| which shows that the sequence of function values is non-increasing. 


□ 


Remark 4.9. The bound in Lemma 4-^ can be improved in some situations where additional in¬ 
formation on the strueture of A is available. For example, if the column space of each Aj span 
orthogonal spaces, that is Af Aj = 0 for any 1 < * < j < then the factor N can be avoided. 

The next lemma constitutes the crucial step towards the establishment of a sublinear convergence 
rate of the CBCG method with adaptive stepsize. 

Lemma 4.10. Let {x^jfcgpj be the sequence generated by the CBCG method with the adaptive stepsize 
strategy given in (4.12). Then, for any k>t), we have 

R(x^) - R(x*^+^ 

where 


2 


Co = 4 


max^ |max |/3j ||Aj||^ Aj|| 


+ 


NLj;.D‘^ maxi=i^2,...,Af ||Ai 

/3min 


(4.23) 


and /3min is defined in (2.4) while Ki is defined in (4.17), for i = 1,2,..., N. 


Proof. For any z G {z = 1, 2,..., N} we have that 


< 2Si(x^’*-i)2 + 2iSiix^) - S*(x^’*-^))2 

< 25i(x^’*-^)2 + 2LlDf\\Aif • ||A(x^ - x^’*-^)f 

AATT"^ n2II a .112 

< 2Si(x^’*-i)2 + -— F ^11 (ff(xfc) _ R(x^+1)), (4.24) 
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where the second inequality follows from Lemma 3.2 and the last inequality follows from Lemma 4.8 
Invoking Lemma 4.7, we obtain from (4.24) that 


D^II a .||2 

< 25i(x^’*-^)2 + ^ (g(x^) - 77(x^+i)) 

Pmin 


ANT‘^ D^II a .||2 

< 4max{/3,||A,f (F(x'=’*-i) - 77(x^’*)) + 

Pmin 


(i7(x'') - Lr(x''+ 1 )). 


Summing (4.25) for i = 1, 2,..., iV yields 

N N 


< 4 J]max{A||A,||2z)2^iL,} 


i=l 


i=l 


mi 


2 N 


+ E A'll (^(x") - iL(x^+i)) 


2 = 1 


iV 


< 


4 max max {A||A,|| 2 z) 2 ,i^,} ^(F(x">*- 1 ) - i7(x"-*)) 

* ’ ’■■■’ j=l 


4Af4D^ma^i.i,;.« II A.||^ 

/^min 


(4.25) 


Finally, using (3.6) we have 


5(x^)2 = 


AT 




i X 


_i=l 


N 


< 


A^5*(x^)2 < AC2(77(x^) -//(x^+i)), 


2 = 1 


which proves the desired result. 

By combining Lemma |3 .1 1 with Lemma 4.10[ we obtain the following corollary. 


□ 


Corollary 4.11. Let {x^jfcgi!,} be the sequence generated by the CBCG method with the adaptive 
stepsize strategy given in (4.12). Then, for any k >0, we have 

H(x*) - - Wf, 


where C 2 is given in (4.23). 

In order to get rate of convergence result in this case we will need the following technical lemma 
(cf. [ 6 l Lemma 3.5]). 

Lemma 4.12. Let {afcjfcgN be a sequence of non-negative real numbers satisfying, for any k £ N, 
the following property 

Ufc - Ofc+i > 7 a|, (4.26) 

and oo < l/( 7 m) for some positive 7 and m. Then, for any k gN, we have that 

1 1 




< 


7 m-\-k 


Combining Lemma 4.12| along with Corollary 4.11| and Lemma 3.1, establishes the sublinear rate 
of convergence of the CBCG method with adaptive stepsize. 
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Theorem 4.13 (Sublinear rate for CBCG with adaptive stepsize). Let he the sequence 

adi 

-H* < 

min 5(x"') < 


generated by the CBCG method with the adaptive stepsize strategy given in (4.12). Then for any 
k > 0 we have 

NC2 


and 


fc + 4’ 

2NC2 


(4.27) 


n=0,l,...,k 


where 


Co = 4 


max |max{/?j||Aj|pC)? TCj}}+ 


fc + 4’ 

NL‘\,D'^ maxj=i_2,...,Af II Ai|p 


/3m 

/3min is defined in (2.4) and Ki is defined in for i = 1,2,..., N. 


Proof. Denote = H{x^) — H*. Thanks to Corollary 4.11 we get that (4.26) holds true with 


7 = 1 /(AC 2 ). In addition, from Lemma 3.1 and (4.20), we have 

N N 


ao — 


77(x°) -H* < 5(x°) = 5i(x°) . ™ax Ki < 

i=l i=l 1-1,2,...,N 


NC 2 


Hence, ag < 1 /( 47 ). Picking m = 4, we conclude from Lemma 4.12 that (4.27) holds. To find the 
bound on the optimality measure S{-), from Lemma 4.10, we have for any n > 0 

H{x^) - > 75(x”)2. 

For any k^ > 0, summing the latter inequality for n = /cq, k^ + 1,..., 2ko, we obtain that 

2fco 


7 Y '5(x”) 2 < i7(x^o) - 77 (x 2^°+^) < 77(x^°) - H{x*) < - 
n=ko 


1 1 


7 fco + 4 ’ 


where the last inequality follows from (4.27). Hence, 

1 


mm 5(x"’)^ < ^ • — -(!---— < ^ ^ 

n=ko,ko+i,...,2ko 7 ^ (fco + l)(A:o + 4) 72 (^0 + 2)2' 


(4.28) 


where we have used the fact that (/cq + l)(A:o + 4) > (fco + 2)^. Similarly, for any /cq > 0, 

2fco+l 


7 Y ‘5(x'')^ < i7(x^0) - - H{x*) < 

n=ko 


1 1 


Hence 


1 


1 


1 

< — 


7 /cq + 4 


1 


min < _ . _ 

n=ko,ko+i,..., 2 ko+i ^ ^ -72 (feo + 2) (A:o + 4) " 7^ (^0 + 5/2)2 ’ 


(4.29) 


where we have used the fact that (/co + 2)(A:o + 4) > (A:o +5/2)2. /co + 5/2 = (2/co + l)/2 + 2. 

Therefore, by combining ( 4.28| ) when k is even and (4.29) when k is odd, we conclude that 

■ CV n^ / 1 1 / 2/VC2 

n=o,i,...,k ^ ^ 7 kf2 + 2 k + 4 


which concludes the proof. 


□ 
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4.3 Backtracking Version of CBCG 


Computing the adaptive stepsize requires to know the constants /3j, i = 1, 2,..., V. In practice, these 
constants may not be known in advance or their known upper-approximations may be too loose. A 
common approach to overcome this problem is to use a backtracking scheme in order to estimate the 
unknown constants that are required to ensure convergence of the algorithm. This strategy is also 
effective in the context of CG-type algorithms. 

The crucial property for the convergence analysis with adaptive stepsize is the block structured de¬ 
scent condition given in Lemma 4.6 When the constants in Lemma 4.6 are not known, a workaround 
is to enforce this descent condition algorithmically, which leads to the following scheme. 


CBCG-B: Cyclic Block Conditional Gradient with Backtracking 
Initialization. x° G X, /3init > 0, k > 1, = 1, i = 1, 2,..., V. 

General Step. For /c = 1, 2,..., 

(i) For any z = 1, 2,..., V, compute 

pf G argminp.g^, {(Vi/(x^’*“^), p*) -h 5i(p0} > 
Find ^ G N, which is the smallest integer ^ such that 

H (- H 


where <r = min i , 1 > and set 


(4.30) 


(4.31) 


Pi = (4 = min ■ 


5*(x^>*-i) 


/3f||A,(pf-x^*-')||2 


1 > 


(ii) Update: x*^+^ = 


(4.32) 

(4.33) 


Under Assumption 2l we have from Lemma 4.6, that (4.31) holds provided that (5^ > f3i. We 


therefore have for any k > 0 and any i = 1,2,..., N that 

Pinit — Pi “Pi Pi — max { K/3j, Pinit } • 


(4.34) 


The main insight is given in the following lemma which its proof follows the same arguments as that 
of Lemma 4.6 using ( 4.31| ). 

Lemma 4.14. Let be the sequence generated by the CBCG-B method. Then, for any A: > 0 

and z G {1, 2,..., N}, we have 

k,i-l\ _ lJ(^k,i\ ^ cj„k,i-l\ ^ \^i) h'i II A „fcM|2 




|A.(pf-x? 


where 


Or = mm 




/3f||A,(p,^-xf)||2 


>1 • 


Using the bounds in (4.34), we obtain the following rate of convergence result for the CBCG 


algorithm with a backtracking scheme. 
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Theorem 4.15 (Sublinear rate for CBCG-B). Let be the sequence generated by the CBCG-B 

method. Then, for any k >0, we have 


-H* < 


NC 3 
fc + 4’ 


and 


min S('K^) < 
n= 0 ,l,.--,fc 


2NC3 

A: + 4’ 


where 


C3 = 4 


max max |/3j|| AjlPT)? iiTjl + 
i=l,2,...,N ' '' 


NL^.D'^ maxi=i^2,...,jv IIA^ 

Pinit 


and Pi is given in {4-34) while Ki is defined in (4. fori = 1,2,... N 

Proof. The arguments of this proof are similar to those in the proof of Theorem |4.13| where we replace 
Lemma 14.61 with Lemma 14.141 □ 


5 Further Discussions 


In this section we begin by showing that the complexity analysis presented in Section 4.2 holds 
for the CBCG method with exact line search when the smooth part of the objective function is 
quadratic. Then, we discuss two issues that are relevant for the implications of our analysis of the 
CBCG method. 


5.1 Exact Line Search for Quadratic Problems 

We recall another well known stepsize rule - exact line search strategy (see, for example, |12jL This 
stepsize is defined as follows: 


a: 


j G argmm, 


oe 


[ 0 , 1 ] 


X 


k,i—l 


+ aU.(pf-x^*-^)). 


(5.1) 


The minimization step (5.1) can incur unnecessary additional computations, unless it can be carried 


out efficiently. This is the case for quadratic functions for which it reduces to the minimization of a 
one-dimensional quadratic over a segment. It is therefore tempting to use CBCG with this stepsize 
rule, but our convergence analysis does not cover it explicitly. However, in the quadratic case, exact 
line search can be recast as an adaptive stepsize strategy. Indeed, suppose for example that the 
problem has the following form: 


1 


mm 


N 


Ea. 

2=1 


X4 




2=1 


2 = 1 


where A* G bj G M""* and gi satisfy Assumption [^i). This problem fits model (2.1) with 

-^(■) = 5 II ■ = (Ai, A 2 ,..., Aat) and gi{-) = (bj,-) + gi{-). Therefore, we can choose fii = 1, 

i = 1,2,..., N and the exact line search strategy is equivalent to our adaptive stepsize strategy given 


in (4.12) since the upper bound of Lemma 4.1 holds with equality. Therefore, in the quadratic case. 


convergence for the exact line search strategy follows from Theorem 4.13 


5.2 Random permutations 

All the arguments presented in Section remain valid if the order of the blocks is not fixed for 
all iterations. The only important element is that all blocks are visited once at each iteration, but 
the order could change, in an arbitrary way, at each iteration. In particular, the arguments are 
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valid when the order of blocks is picked as a random permutation at the beginning of each iteration. 
Therefore, all the convergence rate estimates presented in Section]^ still hold true when using random 
permutations. 

The practical motivation for considering random permutations in the order of the blocks is that 
it may be beneficial in practical settings |29j . Since the derived theoretical efficiency estimates apply 
to any rule for choosing the order of the blocks at the beginning of each iteration (deterministic 
or random), they actually do not explain the potential performance differences between different 
variants, such as purely cyclic versus random permutations. 


5.3 Implications for Coordinate Descent in Machine Learning 

In the case of blocks of size one (i.e., N = n), CBCG method bears a lot of resemblance to coordinate 
descent type methods. For example, in dimension one, a conditional gradient step with exact line 
search is the same as exact minimization. Therefore, CBCG with blocks of size one and exact line 
search is equivalent to coordinate descent with exact minimization. As we have seen in the previous 
section, our convergence analysis holds for this setting in the case of quadratic objectives. This 
setting has some interesting applications. 

The I 2 regularized empirical risk minimization is considered in [29], where the authors analyze 
the stochastic coordinate descent method with exact minimization in the dual - a method referred 
to as “Stochastic Dual Coordinate Ascent” (SDCA). The dual problem can be written as follows 
(using our notations): 


min 

X 


1 ^ , A 1 A 


2=1 


2=1 


2 ^ 


, 


(5.2) 


where A > 0 is a parameter, and for any i = 1,2, ...,n, Xj G 


A,; G 


and Qi is a closed, 


convex and proper function with a bounded domain. This is exactly the setting of Section |5.1| and 
therefore Theorem 4.13 gives explicit deterministic rate estimates of the duality gap for the cyclic and 
random permutation coordinate descent variants applied to problem (5.2) when gi, g 2 ^ ■ ■ ■, Qn satisfy 
Assumption!^ It can be checked that in the setting of problem (5.2) (with A = 1 for simplicity), we 
get in Theorem |4.13 that as n grows, the dominant term in the rate estimate (4.27) is of the form 
4D^ maxj ||Aj|p/(/c+ 4). The quantity remains proportional to n (see identity (2.3)) and the rate 
estimate of (4.27) still suffers from a multiplicative dependence in n, since each iteration requires 
an effective pass through the n coordinates. Therefore, these results remain mostly of theoretical 
interest in the context of coordinate descent in machine learning where the focus is on large n and 
there is space for improvement in order to theoretically grasp the practical performances of these 
types of methods. 


6 Numerical Experiments 


The experiments presented in this section correspond to situations where for each i, the nonsmooth 


function gi, is taken to be the indicator of the set Xi C 


with the eventual addition of a 


linear term. We therefore recover the smooth constrained optimization model of the traditional 


conditional gradient method (see also Remark 2.1). Furthermore, for the cyclic selection rule, in all 
the experiments, we use the “random permutation” approach which consists in randomly changing 
the order of the blocks at the beginning of each iteration. The convergence analyses described so 
far do not depend on the order of the blocks and are still valid when it changes at each iteration. 
Therefore, all the results of Section [^ hold deterministically for this “random permutation” rule 
which we do not differentiate from the cyclic update rule, and hence the corresponding algorithms 
bear the same name in this section. We begin with a modeling remark in Section |5.1[ Numerical 
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results are given in Section 6.1 for synthetic examples and in Section 6.2 for the structured SVM 
training problem. 


6.1 Synthetic Problems 

We generate artihcial convex quadratic problems with box constraints in We consider problems 
of the following form 

( 6 . 1 ) 


1 


mm -(x-y)^Q(x-y), 
X I <1 Z 


where 


denotes the too norm, y G 


)100 


and Q G 


xioo 


are given. The problem has a natural 


coordinate-wise structure which allows to apply the CBCG algorithm where each block consists of 
a single coordinate. Indeed, by setting /(•) = \{- — y)^Q(' — y) and gi = i = 1, 2,..., 100, 

problem (6.1) fits our model (2.1). We generate random instances of problem (6.1) as follows. 


• Set d = 100 and n = 200. 

• Generate X G with standard normal entries. 

• Set Q = IX^D^X where D = diag(l/n^, l/(ra — 1)^,..., 1) 

• Generate y with standard normal entries. 

We compare the standard Gonditional Gradient (GG) algorithm [T7j, the Random Block Gonditional 
Gradient (RBGG) algorithm |T9] (with block chosen uniformly at random), the Gyclic Block Gon¬ 
ditional Gradient where the order of the blocks is fixed for all iterations (GBGG-G) and the Gyclic 
Block Gonditional Gradient where the order of the blocks is chosen by a random permutation at each 
iteration (GBGG-P), which is the method of interest in this paper. For each algorithm, we compare 
three different stepsize rules: 


Predefined: the stepsize which is given in (4.3). Note that for RBGG, there is a slight 
modification in the definition of the predefined stepsize. There is no notion of a “cycle” in 
RBGG and the stepsize has the form 2n /{k + 2n) where k is the number of blocks updated so 
far, see [T^ . 


Adaptive with Backtracking: set A = I and compute the stepsize using the backtracking 
scheme. 


Exact line search: the stepsize is chosen to minimize the objective as in (|5.1|). Note that as 
explained in Section . 


5.1 


this corresponds to use the adaptive stepsize (4.12) with A = DX/-y/n 


since we consider a quadratic objective function. 


For problems of the form (6.1), the second strategy is questionable since a closed form expression is 


available for the exact line search. However, the comparison to this strategy illustrates differences 


in algorithmic behaviors related to the choice of A in the model (2.1) as well as the performance of 
the backtracking scheme. 

Remark 6.1. For CG and RBGG, a sublinear convergence rate holds for the three stepsize strategies. 
For both GBGG-G and GBGG-P, in the case of the predefined and of the adaptive stepsizes, conver¬ 


gence is ensured by Theorems f.5 and f.ld, respectively. Furthermore, since we consider quadratic 
objective functions, exact line search stepsize strategy can be seen as a particular case of our adaptive 


strategy and the convergence follows from Theorem 4-13 (see also Section 5.1). 
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Figure 1: Comparison of Conditional Gradient (CG), its random block version (RBDG), its cyclic 
block version with fixed block order (GBCG-G) and its cyclic block version with random permutation 
block order (GBGG-P). We compare three different stepsize strategies based on 1000 randomly 
generated instances of problem (6.1). The central line is the median over the 1000 runs and the 
ribbons show 98%, 90%, 80%, 60% and 40% quantiles. For all methods, k represents the number of 
effective passes through d coordinates. 


We generate 1000 random instances of problem (6.1), denoting the objective function of 
the rc-th randomly generated problem, w = 1, 2,..., 1000. For each problem, we run the four 
different algorithms with the three different stepsize rules (the initialization is at the origin). The 
results for the hrst 10 iterations are presented in Figure For each algorithm, increasing A: by 1 
means that N = 100 blocks have been queried, randomly for RBCG, sequentially for GBGG and 
all at once for GG. Since each objective function is generated randomly, it does not make sense to 
directly compare performance across different problems on the same graph. To overcome this, for 
each w = 1, 2,..., 1000, we “center” and “scale” the function values. That is, for each objective 
fw, w = 1,2 ,..., 1000, we estimate the optimal value of (6.1) by running GBGG-P with exact 
line search for 200 more iterations. For such a number of iterations, we observed on preliminary 
experiments that the algorithm reached machine precision on random instances of problem (6.1). 
The quantity plotted in Figure is given by the following affine transformation. 


/^(x^) - f* 
fw{0) - fw ’ 


so that in Figure the first value is always 1 and the represented quantities are positive and 
asymptotically tend to 0. The main comments regarding this experiment are the following: 


• For each stepsize rule, GBGG-G has an advantage. 

• There is not much difference between the two variants CBCG-G and CBCG-P in this experi¬ 
ment. 

• The predefined stepsize rule leads to much slower convergence. 

• Adaptive with backtracking and exact line search rules yield improved convergence speed for 
both GBGG and RBGG, which is not the case with the GG. 

• Exact line search rule has a slight advantage over the backtracking rule. 
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The last point deserves further comments. Indeed, in model (2.1), there are different possible choices 
of matrix A and function F that lead to equivalent problems. Despite being equivalent problems, 
different model choices may lead to variations in algorithmic performances when numerically solving 
a problem. This is what we observe here. Indeed, as pointed out in Section 5.1 the exact line search 
strategy corresponds to choosing an F that is perfectly conditioned. This means that variations of 
F around its minimum are isotropic. On the other hand, choosing A = I leads to a choice of F 
that is less well-behaved. These numerical experiments suggest that choosing A that corresponds 
to a better conditioned F leads to better numerical performances. Finally, the gap between the two 
strategies remains small, highlighting the efficiency of the backtracking scheme. 


6.2 Structured SVM 

The main motivation for the introduction of block version of CG method with random update rule 
in m is that it leads to a new efficient algorithm for training the structured SVM [301 EU- We refer 
the reader to m and the references therein for background on this problem and its relations to the 
conditional gradient method. In brief, the structured SVM solves a multi-class classification task. It 
is dedicated to problems for which the output classes are embedded in a combinatorial structure such 
as trees, graphs or sequences. In this setting the number of classes can be enormous which results in 
optimization problems with an untractable number of linear constraints. For some of these problems 
efficient decoding algorithms can be used as oracles to compute sub-gradients of the structured SVM 
problem. They can also be used as oracles to solve the linearized sub-problem required to run the 
conditional gradient and block conditional gradient methods on the dual of the structured SVM. 
In this section, we briefly recall the mathematical formulation of the dual structured SVM problem 
and provide a numerical comparison between random and cyclic update rules for block conditional 
gradient in this context. The purpose is not to be exhaustive here and we solely focus on the aspects 
of the problem related to optimization. Therefore, we compare the numerical performances of the 
two selection rules on this real world example based on an optimization criterion. In this section, 
N denotes the number of training examples, M denotes the number of output classes and d is an 
integer such that is a space of tractable size (such that elements of can be stored in memory). 
The dual variable of the structured SVM is a matrix a G -^NxM non-negative entries (this 

could actually be refined with example dependent output classes, but we stick to this notation as a 
first approximation). The dual problem of the structured SVM can be written as follows 

min / — II Aq;|P — Tr(b^Q:) : al^ = I'^l , (6.2) 

Q>o I 2 I 


Tr denotes the trace operator 


where A : —>■ is a linear map, b G -^NxM -g matrix, 

and 1® denotes the vector in which all entries are 1. Problem (6.2) has an interesting product 
structure. Indeed, its feasible set can be viewed as a product of N simplices of dimension M. In the 
context of structured SVM it is not necessary to store o: explicitly, it is sufficient to store and update 
Aa G and Tr(b^Q:). With this information, we can use the specific decoding oracles to solve 
simplex constrained linear subproblems and run the conditional gradient algorithm. This constitutes 
the main advantage of the method here, it allows to explore a potentially very large space by using 
only implicit conditional gradient steps (recall that M is the size of a set of combinatorial nature, 
see [19] for a complete derivation and more details). We use the code provided by the authors of 
m which gives the possibility to train the structured SVM using RBCG, GBGG-G (fixed block 
order) or GBGG-P (random permutation block order) on the Optical Gharacter Recognition (OGR) 
originally proposed in m- We consider random update rule and cyclic update rule with varying 
block order (or random permutation). We only consider the exact line search strategy (5.1). Note 
that our convergence analysis can be applied in this setting (see also Section 5.1). It can be checked 


(see [T9| for details about the structured SVM model) that the constant C 2 given by Theorem 4.13 
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Figure 2: Comparison of RBCG and CBCG-C (purely cyclic) and CBCG-P (random permutation) for 
the structured SVM training on the OCR dataset of [30]. The quantity represented is the optimality 


measure defined by (3.1) and the heading represents the value of A (from (6.2)). The central line is 
the median over 20 runs and the ribbons show 80% and 50% quantiles. For all methods, k represents 
the number of effective passes through all the blocks (which correspond to datapoints here). 


remains bounded away from zero as N grows, in the setting of model (6.2). Therefore, the rate 
given in (4.27) suffers from a multiplicative dependence in N. From a theoretical point of view, in 
the context of large training sets (large N), this problematic dependence leaves room for improved 
convergence analysis. Numerical results in terms of the optimality measure S for various values of A 
are given in Figure]^ The global behavior is similar to what we observed in the synthetic examples 
of Section |6.1[ In particular, the cyclic update rule has a slight advantage. The global behavior 
differs from what we have observed in the synthetic examples of Section 6.1 The random block 


selection rule (RBCG) has an advantage over the purely cyclic block selection rule (GBGG-G). The 
random permutation approach (CBCG-P) gives the best performances here. Since our analysis is 
the same for both CBCG-G and CBCG-P, it does not allow to explain this empirical difference. 


7 Discussion and Future Work 

7.1 Comparison with existing results 

We have described the CBCG algorithm and provided an explicit sublinear convergence rate estimate 
of the form 0{l/k), which is canonical for linear oracle based algorithms (up to multiplicative factor) 
|26| . This asymptotic rate cannot be improved in general [TO]. A few words on the constants are in 
order. For the traditional CG method, the multiplicative constant is proportional to LpD^ 1170 In 
order to compare these rates with those obtained for block decomposition methods, we consider in 
this discussion that one iteration of a block decomposition method consists in updating N arbitrarily 
chosen blocks (which is consistent with the notation in this paper). The multiplicative constant 
proposed in m for the average case complexity of RBCG is proportional to {LpD'^ -|- 17(x^) — H*) 
independently of the number of block^ The independence of the constant with respect to the 
number of blocks is an important property that we do not recover for the cyclic block updating rule. 

^The constant presented in m relates to an affine invariant notion of curvature, but it can be upper bounded by 
the term we consider. 

^As in the case of CG, the result is not presented with this exact constant, but it is an upper bound that we use 
for discussion purposes. 
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This behavior is consistent with previous observations comparing random and cyclic update rules 
[6] where multiplicative dependance in the number of blocks also appeared in the convergence rate 
analysis of different types of methods. 

For the sake of clarity, we will discuss the results of section when A = I, all Dj are equal and 
/3i = Lp, i = 1,2, For the predefined stepsize, the quantity C\/LpD‘^ (Ci given by (4.4)) 

grows like Furthermore, for the adaptive stepsize, the quantity NC 2 /LpD"^ (C 2 given by 

(4.23)) grows like This is much worse than the Vn dependance of the predefined strategy 

although practical simulations tend to show that the adaptive rule is much faster. Therefore, there 
is a room for improvement in the analysis and this raises the natural question of the possibility to 
get multiplicative constants that do not depend on the number of blocks (which is in general the case 
for random block selection rules). As we have seen in Sections and for SDCA and structured 
SVM applications, the effect is to have a multiplicative dependance of the convergence rate in the 
size of the training set. Therefore, although theoretically interesting, the rates we derive are limited 
to explain performances on large training sets (large number of blocks in the dual) which is the 
motivation for using block decomposition methods in machine learning applications. 

However, contrary to the average case complexity estimate of RBCG |19j . the exact expression of 
the rate is not a direct generalization of that of CG algorithm. Recall that for traditional GG method, 
the multiplicative constant is proportional to LpD^ m- The constants given in Theorems |4.5| and 


4.13 feature additional multiplicative and additive terms. In particular the quantity Ci/LpD'^, for 
Cl given by (4.4) or the quantity NC 2 /LpD"^ for C 2 given by (4.23), show multiplicative dependence 
in the number of blocks N. This behavior is consistent with previous observations comparing random 
and cyclic update rules [6]. It is expected that the analysis of cyclic rules lead to worse constants since 
they represent worst case complexity analysis. An important theoretical question is whether this can 
be leveraged or not for cyclic rules. In other words, is it possible to prove an explicit convergence 
rate of the form M/k for the cyclic rule such that M/{LpD‘^) does not depend on the number of 
blocks N? 

For practical applications however, this remark should be mitigated since we are comparing upper 
bounds. These bounds may only reflect limitations of the analysis, not of the methods, and their 
comparison may not shed much light on the comparative behavior of different rules on practical 
problems. Indeed, our numerical experiments on synthetic and real-world examples reproducibly 
suggest an advantage of cyclic or random permutation variants over fully random update rule. This 
is again something that has already been observed in other contexts A question related 

to the discussion of the previous paragraph is to give a theoretical justification to this observation 
or eventually provide different iteration dependent block update rules that comply with it (see for 
example [25] for an illustration in the context of gradient descent). 


7.2 Future Directions 


Finally, a natural question is that of the extension of specificities of the conditional gradient method 
in our block decomposition setting. Potential directions include the following: 


• Linear convergence. GG is known to converge linearly when the optimum is in the relative 
interior of the feasible set iniEii or when the feasible set is strongly convex and the gradient 
of the smooth part of the objective is non-zero on the feasible set |2Ij . 

• Dual interpretation of the block decomposition. Generalized GG is known to implicitly 
generate subgradient sequences [I] related to the mirror descent algorithm |3] applied to a dual 
problem. Similarly, a dual interpretation of RBGG is in terms of stochastic subgradient |19j . 


Generalization of the results to exact line search stepsize strategies. The analysis of 
such stepsize strategies is not problematic for GG or RBGG HZHEI. However, we could not 
generalize it for GBGG, except in the quadratic case (see Section 5.1), and thus developing an 
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analysis for the exact line search strategy is in our opinion an important task. Most of our 
analysis collapses without further assumptions, because it is no longer possible to get sufficient 
decrease conditions of the type of Lemma 4.8 We therefore expect that a different path needs 
to be considered. 


• Generalization to inexact oracles. In many practical applications, the search direction 
given by the oracle is computed by an algorithm. It is therefore relevant to consider multi¬ 
plicative errors in order to use well defined stopping criteria for the inner algorithm. This was 
already considered in the random block variant |19j . 


A 


Proof of Lemma 


2.2 


We adapt the standard proof of the descent Lemma, see e.g., [71 Proposition A.24]. Under the 
assumptions of the lemma, using the fundamental theorem of calculus for line integrals on the 
segment [x, x -|- Uih], we have 

f{x + lJih) = f{'x.)+ [ (V/(x-htUih),Uih) dt 
Jo 

= /(x) + (V/(x), Uih) + / (V/(x + tUih) - V/(x), Uih) dt. (A.l) 

Jo 

We can bound the integrand term for any t G [0,1] as follows 

(V/(x + tUih) - V/(x), Uih) = (A^VF(A(x + tUih)) - A^VF(Ax), U,h> 

= (VF(A(x + tUih)) - VT(Ax), AU,h) 

= (VF(Ax -h tAih) - VF(Ax), Aih) 

< ||VF(Ax -b tAih)) - VF(Ax)|| • ||Aihjj 

< t/3i||A,h||2, (A.2) 


where we have used Cauchy-Schwartz inequality and Assumption]^ to obtain the last two inequalities. 
Combining (A.l) and (A.2), we have 

/(x-bUih) </(x)-b (V/(x),Uih)-b/3i||Aih|p [ t dt 

Jo 

= /(x) + (Vi/(x), h) -b 


which proves the desired result. 


□ 
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